Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Chromatin Immunoprecipitation Sequencing ◾ 217

or other strategies. There are several peak-calling programs that use their own algorithms

to define protein-binding sites in the genome by identifying regions where sequence reads

are enriched after mapping to a reference genome. The peak caller assumes that ChIP-Seq

reads should align in a larger number to the sites of protein binding than to other regions

on the genome (Figure 6.2).

Peak-calling programs use different strategies to compute the statistical significance of

peaks in the binding sites. Some peak callers assume Poisson or negative binomial distri-

bution to model the counts of the reads and to compute the p-value for the statistical sig-

nificance of the peak with respect to the background. Since multiple windows (thousands)

are tested generating multiple p-values, the chance of making Type I error (false positive)

will increase. Some peak callers adjust the p-value based on the number of windows by

computing the false discovery rate (FDR). Other callers use the height of peaks over back-

ground without providing statistical significance metric and others use machine learning

to generate statistical metrics that allow peak calling. Sequencing depth and library com-

plexity are crucial for statistical significance of the fold enrichment.

In general, most peak aligners perform well. However, attention should be paid to the

type of binding sites that a peak caller is good for. For instance, some callers are good for

TFs (e.g., SISSRs [3]) and some are good for histone modifications, and some can handle

both (e.g., MACS [2]). Some callers do not use paired-end library although paired-end

reads can be treated as single-end reads by using either forward or reverse reads but not

both. HOMER [4] caller was developed as a tool for de novo motif identification from

peak regions. JAMM [5] requires replicated samples to improve confidence in peak call-

ing. We should also pay attention to how a caller handles broad and sharp peaks. Callers

may merge the close peaks and that may lead to the loss of some resolution. ChIP-Seq

reads originated for histone modification generate a broad peak signal that requires a large

region. However, determining a region boundary for the histone enrichment is still a chal-

lenge for the peak callers. In contrast to the histone modifications, ChIP-Seq signals of TFs

and Poly II exhibit sharp peak signal, and therefore, the peak callers suitable for TFs and

Poly II should be able to identify those narrow regions. In the following, we will discuss the

steps of ChIP-Seq workflow with a worked example.

6.3.1 Downloading the Raw Data

For practicing with ChIP data, we will download data from ENCODE project at “https://

www.encodeproject.org”. The raw data is from a ChIP-Seq experiment with an acces-

sion number ENCSR000EZL that includes three samples from HeLa-S3 cell line estab-

lished from cervical adenocarcinoma of Henrietta Lacks, who was an African American

woman died on October 4, 1951, at the age of 31, but her cells continue to have impact on

the world by making significant contributions to the scientific progress and advances in

human health. This ChIP experiment targeted DNA-directed RNA polymerase II subunit

RPB1 (encoded by POLR2A gene), which is the largest subunit of RNA polymerase II that

synthesizes all mRNA in eukaryotes. It initiates the transcription by allowing a single-

stranded DNA template strand of the promoter of a targeted gene to position itself within

its central active site. The mRNA is formed as the complementary transcript to the template